{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.core.interactiveshell import InteractiveShell\n",
    "InteractiveShell.ast_node_interactivity = 'all'  # default is ‘last_expr'\n",
    "\n",
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "sys.path.append('/home/mink/notebooks/CameraTraps')  # append this repo to PYTHONPATH\n",
    "sys.path.append('/home/mink/lib/ai4eutils')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import os\n",
    "from collections import Counter, defaultdict\n",
    "from random import sample\n",
    "from shutil import copyfile\n",
    "from multiprocessing.pool import ThreadPool\n",
    "\n",
    "from tqdm import tqdm\n",
    "from unidecode import unidecode \n",
    "\n",
    "from data_management.megadb.schema import sequences_schema_check\n",
    "from data_management.megadb.converters.cct_to_megadb import make_cct_embedded, process_sequences, write_json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# wellington_nz\n",
    "\n",
    "This notebook is a template for how new datasets can be formatted for ingestion into the database.\n",
    "\n",
    "The ideal dataset has both **location** and **sequence** information, in addition to any species or bounding box labels."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Give the path to a JSON file where output from this script will be written to. You can then take this file to the .Net app for ingestion to the database."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_name = 'wellington_nz'\n",
    "\n",
    "container_root = '/mink_disk_0/camtraps/wellington_nz'  # AzCopied the container to data disk\n",
    "path_prefix = 'images'\n",
    "\n",
    "path_to_output = f'/home/mink/camtraps/data/megadb_jsons/{dataset_name}.json' \n",
    "path_to_output_temp = f'/home/mink/camtraps/data/megadb_jsons/{dataset_name}_temp.json' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 0 - Add an entry to the `datasets` table\n",
    "\n",
    "Done"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Prepare the `sequence` objects to insert into the database"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 1a - If you have metadata in COCO Camera Traps (CCT) format already...\n",
    "\n",
    "For a dataset, you probably have one or two JSONs in the CCT format, one containing image-level species labels and another containing bounding box annotations. Here we combine them and embed any annotation items into the image items."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# path to the CCT json, or a loaded json object\n",
    "path_to_image_cct = '/home/mink/camtraps/data/prev_labels/wellington_camera_traps.json'  # set to None if not available\n",
    "path_to_bbox_cct = None # set to None if not available\n",
    "assert not (path_to_image_cct is None and path_to_bbox_cct is None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You might like to process the resulting `embedded` dataset a little more:\n",
    "\n",
    "- For image entries that do not have species label but have bounding box annotations, you can add a `species` field to the `annotations` field of each item in the list `embedded`, according to the `category` field of the first `bbox` item in `annotations`:\n",
    "    - If `e['annotations']['bbox'][category']` is `person`, assign `['human']` to `e['annotations']['species']`. Note that it needs to be a list (of one item).\n",
    "    - If `animal`, assign `['unidentified']`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading image DB...\n",
      "Number of items from the image DB: 270450\n",
      "Number of images with more than 1 species: 0 (0.0% of image DB)\n",
      "No bbox DB provided.\n",
      "CPU times: user 2.38 s, sys: 455 ms, total: 2.83 s\n",
      "Wall time: 3.1 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "embedded = make_cct_embedded(image_db=path_to_image_cct, bbox_db=path_to_bbox_cct)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the following step, properties will be moved to the highest level that is still correct, i.e. if a property at the image-level always has the smae value for all images in a sequence, it will be moved to be a sequence-level property.\n",
    "\n",
    "If a sequence-level property has the same value throughout this dataset (often 'rights holder'), it will be removed from the `sequence` objects. A message about this will be printed, and you should add that property and its (constant) value to this dataset's entry in the `datasets` table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The dataset_name is set to wellington_nz. Please make sure this is correct!\n",
      "Making a deep copy of docs...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " 15%|█▌        | 41468/270450 [00:00<00:00, 414594.40it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Putting 270450 images into sequences...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 270450/270450 [00:00<00:00, 527698.43it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of sequences: 90478\n",
      "Checking the location field...\n",
      "Checking which fields in a CCT image entry are sequence-level...\n",
      "\n",
      "all_img_properties\n",
      "{'class', 'file', 'site', 'frame_num', 'camera', 'id', 'datetime'}\n",
      "\n",
      "img_level_properties\n",
      "{'id', 'frame_num', 'file'}\n",
      "\n",
      "image-level properties that really should be sequence-level\n",
      "{'class', 'camera', 'site', 'datetime'}\n",
      "\n",
      "Finished processing sequences.\n",
      "Example sequence items:\n",
      "\n",
      "{\"dataset\": \"wellington_nz\", \"seq_id\": \"2\", \"images\": [{\"id\": \"290716114012001a1116\", \"frame_num\": 0, \"file\": \"290716114012001a1116.JPG\"}, {\"id\": \"290716114014001a1114\", \"frame_num\": 1, \"file\": \"290716114014001a1114.JPG\"}, {\"id\": \"290716114014001a1115\", \"frame_num\": 2, \"file\": \"290716114014001a1115.JPG\"}], \"class\": [\"bird\"], \"datetime\": \"7/29/2016 11:40\", \"camera\": \"111\", \"site\": \"001a\"}\n",
      "\n",
      "{\"dataset\": \"wellington_nz\", \"seq_id\": \"40003\", \"images\": [{\"id\": \"030716222820038as242\", \"frame_num\": 0, \"file\": \"030716222820038as242.JPG\"}, {\"id\": \"030716222820038as243\", \"frame_num\": 1, \"file\": \"030716222820038as243.JPG\"}, {\"id\": \"030716222822038as241\", \"frame_num\": 2, \"file\": \"030716222822038as241.JPG\"}], \"class\": [\"ship rat\"], \"datetime\": \"7/3/2016 22:28\", \"camera\": \"s24\", \"site\": \"038a\"}\n",
      "\n",
      "CPU times: user 8.17 s, sys: 253 ms, total: 8.42 s\n",
      "Wall time: 8.44 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "sequences = process_sequences(embedded, dataset_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "215"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# only addition step: change the 'site' attribute to 'location' to be consistent with other datasets\n",
    "locations = set()\n",
    "for seq in sequences:\n",
    "    seq['location'] = seq['site']\n",
    "    del seq['site']\n",
    "    \n",
    "    locations.add(seq['location'])\n",
    "\n",
    "len(locations)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[OrderedDict([('dataset', 'wellington_nz'),\n",
       "              ('seq_id', '36962'),\n",
       "              ('images',\n",
       "               [{'id': '190116070904022c5351',\n",
       "                 'frame_num': 0,\n",
       "                 'file': '190116070904022c5351.JPG'},\n",
       "                {'id': '190116070904022c5352',\n",
       "                 'frame_num': 1,\n",
       "                 'file': '190116070904022c5352.JPG'},\n",
       "                {'id': '190116070906022c5353',\n",
       "                 'frame_num': 2,\n",
       "                 'file': '190116070906022c5353.JPG'}]),\n",
       "              ('class', ['bird']),\n",
       "              ('datetime', '1/19/2016 7:09'),\n",
       "              ('camera', '535'),\n",
       "              ('location', '022c')]),\n",
       " OrderedDict([('dataset', 'wellington_nz'),\n",
       "              ('seq_id', '50836'),\n",
       "              ('images',\n",
       "               [{'id': '050116122908030bs101',\n",
       "                 'frame_num': 0,\n",
       "                 'file': '050116122908030bs101.JPG'},\n",
       "                {'id': '050116122908030bs102',\n",
       "                 'frame_num': 1,\n",
       "                 'file': '050116122908030bs102.JPG'},\n",
       "                {'id': '050116122908030bs103',\n",
       "                 'frame_num': 2,\n",
       "                 'file': '050116122908030bs103.JPG'}]),\n",
       "              ('class', ['empty']),\n",
       "              ('datetime', '1/5/2016 12:29'),\n",
       "              ('camera', 's10'),\n",
       "              ('location', '030b')])]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# sample some sequences to make sure they are what you expect\n",
    "sample(sequences, 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Pass the schema check\n",
    "\n",
    "Once your metadata are in the MegaDB format for `sequence` items, we check that they conform to the format's schema.\n",
    "\n",
    "If the format conforms, the following messages will be printed:\n",
    "\n",
    "```\n",
    "Verified that the sequence items meet requirements not captured by the schema.\n",
    "Verified that the sequence items conform to the schema.\n",
    "```\n",
    "\n",
    "For large datasets, the second step will take some time (~ a minute). \n",
    "\n",
    "Otherwise there will be an error message describing what's wrong. Please fix the issues until all checks are passed. You might need to write some snippets of code to loop through the `sequence` items to understand which entries have problems."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Verified that the sequence items meet requirements not captured by the schema.\n",
      "Verified that the sequence items conform to the schema.\n",
      "CPU times: user 14 s, sys: 379 ms, total: 14.4 s\n",
      "Wall time: 14.4 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "sequences_schema_check.sequences_schema_check(sequences)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(path_to_output_temp, 'w', encoding='utf-8') as f:\n",
    "    json.dump(sequences, f, indent=1, ensure_ascii=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2b - copy images to flat folder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(path_to_output_temp) as f:\n",
    "    sequences = json.load(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "90478"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(sequences)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "def copy_file(src_path, dst_path):\n",
    "    return copyfile(src_path, dst_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 90478/90478 [00:01<00:00, 80242.79it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1.14 s, sys: 18.6 ms, total: 1.16 s\n",
      "Wall time: 1.15 s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "path_pairs = []\n",
    "for seq in tqdm(sequences):\n",
    "    seq_id = seq['seq_id']\n",
    "    for im in seq['images']:\n",
    "        src_path = os.path.join(container_root, path_prefix, im['file'])\n",
    "        # assert os.path.exists(src_path), src_path\n",
    "        frame = im['frame_num']\n",
    "        dst_path = os.path.join('/mink_disk_0/camtraps/imerit12', \n",
    "                                f'{dataset_name}.seq{seq_id}.frame{frame}.jpg')\n",
    "        path_pairs.append((src_path, dst_path))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('/mink_disk_0/camtraps/wellington_nz/images/290116054720002c8232.JPG',\n",
       " '/mink_disk_0/camtraps/imerit12/wellington_nz.seq551.frame1.jpg')"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "path_pairs[1000]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1min 32s, sys: 5min 35s, total: 7min 8s\n",
      "Wall time: 36min 25s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "with ThreadPool(4) as pool:\n",
    "    dst_paths = pool.starmap(copy_file, path_pairs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "270450"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(dst_paths)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Moved the images to `imerti12a`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 90478/90478 [00:00<00:00, 486512.92it/s]\n"
     ]
    }
   ],
   "source": [
    "# Oops, delete images that are labeled empty\n",
    "\n",
    "empty_images = []\n",
    "\n",
    "for seq in tqdm(sequences):\n",
    "    seq_id = seq['seq_id']\n",
    "    if len(seq['class']) == 1 and seq['class'][0] == 'empty':\n",
    "        for im in seq['images']:\n",
    "            frame = im['frame_num']\n",
    "            dst_path = os.path.join('/mink_disk_0/camtraps/imerit12a', \n",
    "                                f'{dataset_name}.seq{seq_id}.frame{frame}.jpg')\n",
    "            empty_images.append(dst_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "45721"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "'/mink_disk_0/camtraps/imerit12a/wellington_nz.seq236.frame1.jpg'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "'/mink_disk_0/camtraps/imerit12a/wellington_nz.seq89955.frame2.jpg'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(empty_images)  # About 17% are empty\n",
    "empty_images[100]\n",
    "empty_images[-1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def remove_file(p):\n",
    "    os.remove(p)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 666 ms, sys: 5.66 s, total: 6.32 s\n",
      "Wall time: 17.2 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "with ThreadPool(8) as pool:\n",
    "    dst_paths = pool.map(remove_file, empty_images)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:cameratraps] *",
   "language": "python",
   "name": "conda-env-cameratraps-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
